House Price Prediction DatasetΒΆ
House price prediction refers to the process of estimating the future or current values of a residential property based on various factors and historical data.The goal is to forecast the price of a house given its characteristics and the context of the real estate market.
let's install and import the libraries. We'll use the matplotlib.pyplot module for basic plots like line & bar charts. It is often imported with the alias plt. We'll use the seaborn module for more advanced plots. It is commonly imported with the alias sns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore') # if there are any warning due to version mismatch, it will be ignored
Load datasetΒΆ
df=pd.read_csv('data (1).csv')
df.head()
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
| 1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
| 2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
| 3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
| 4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
Basic VisualizationΒΆ
df.shape # checking the n0.of rows and columns
(4600, 18)
df.head() #Display the first 5rows in the dataframe
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
| 1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
| 2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
| 3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
| 4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
df = df.drop_duplicates()
df.head()
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
| 1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
| 2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
| 3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
| 4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
Renaming ColumnsΒΆ
df.rename(columns={'floors': 'levels'}).head(5)
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | levels | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
| 1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
| 2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
| 3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
| 4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
df.rename(columns={'price': 'cost'}).head(10)
| date | cost | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
| 1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
| 2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
| 3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
| 4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
| 5 | 2014-05-02 00:00:00 | 490000.0 | 2.0 | 1.00 | 880 | 6380 | 1.0 | 0 | 0 | 3 | 880 | 0 | 1938 | 1994 | 522 NE 88th St | Seattle | WA 98115 | USA |
| 6 | 2014-05-02 00:00:00 | 335000.0 | 2.0 | 2.00 | 1350 | 2560 | 1.0 | 0 | 0 | 3 | 1350 | 0 | 1976 | 0 | 2616 174th Ave NE | Redmond | WA 98052 | USA |
| 7 | 2014-05-02 00:00:00 | 482000.0 | 4.0 | 2.50 | 2710 | 35868 | 2.0 | 0 | 0 | 3 | 2710 | 0 | 1989 | 0 | 23762 SE 253rd Pl | Maple Valley | WA 98038 | USA |
| 8 | 2014-05-02 00:00:00 | 452500.0 | 3.0 | 2.50 | 2430 | 88426 | 1.0 | 0 | 0 | 4 | 1570 | 860 | 1985 | 0 | 46611-46625 SE 129th St | North Bend | WA 98045 | USA |
| 9 | 2014-05-02 00:00:00 | 640000.0 | 4.0 | 2.00 | 1520 | 6200 | 1.5 | 0 | 0 | 3 | 1520 | 0 | 1945 | 2010 | 6811 55th Ave NE | Seattle | WA 98115 | USA |
df.tail() #provides last 5 samples
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 4595 | 2014-07-09 00:00:00 | 308166.666667 | 3.0 | 1.75 | 1510 | 6360 | 1.0 | 0 | 0 | 4 | 1510 | 0 | 1954 | 1979 | 501 N 143rd St | Seattle | WA 98133 | USA |
| 4596 | 2014-07-09 00:00:00 | 534333.333333 | 3.0 | 2.50 | 1460 | 7573 | 2.0 | 0 | 0 | 3 | 1460 | 0 | 1983 | 2009 | 14855 SE 10th Pl | Bellevue | WA 98007 | USA |
| 4597 | 2014-07-09 00:00:00 | 416904.166667 | 3.0 | 2.50 | 3010 | 7014 | 2.0 | 0 | 0 | 3 | 3010 | 0 | 2009 | 0 | 759 Ilwaco Pl NE | Renton | WA 98059 | USA |
| 4598 | 2014-07-10 00:00:00 | 203400.000000 | 4.0 | 2.00 | 2090 | 6630 | 1.0 | 0 | 0 | 3 | 1070 | 1020 | 1974 | 0 | 5148 S Creston St | Seattle | WA 98178 | USA |
| 4599 | 2014-07-10 00:00:00 | 220600.000000 | 3.0 | 2.50 | 1490 | 8102 | 2.0 | 0 | 0 | 4 | 1490 | 0 | 1990 | 0 | 18717 SE 258th St | Covington | WA 98042 | USA |
df.columns
Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'street', 'city',
'statezip', 'country'],
dtype='object')
df.sample()
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2044 | 2014-06-06 00:00:00 | 185000.0 | 3.0 | 1.0 | 1840 | 8100 | 1.0 | 0 | 0 | 4 | 920 | 920 | 1953 | 1983 | 16433 12th Ave SW | Burien | WA 98166 | USA |
df.sample(10)
| date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 384 | 2014-05-08 00:00:00 | 3.219500e+05 | 2.0 | 1.25 | 860 | 1277 | 2.0 | 0 | 0 | 3 | 860 | 0 | 2007 | 0 | 2113 15th Ave S | Seattle | WA 98144 | USA |
| 2224 | 2014-06-10 00:00:00 | 4.030000e+05 | 2.0 | 1.00 | 1100 | 3598 | 1.0 | 0 | 0 | 4 | 1100 | 0 | 1926 | 1993 | 3242 15th Ave S | Seattle | WA 98144 | USA |
| 4381 | 2014-05-12 00:00:00 | 5.719861e+05 | 3.0 | 2.50 | 3720 | 11610 | 2.0 | 0 | 0 | 3 | 3720 | 0 | 1982 | 0 | 21730 NE 29th St | Sammamish | WA 98074 | USA |
| 2937 | 2014-06-20 00:00:00 | 6.659000e+05 | 4.0 | 2.25 | 2870 | 5453 | 2.0 | 0 | 1 | 4 | 2220 | 650 | 1926 | 1993 | 10030 44th Ave SW | Seattle | WA 98146 | USA |
| 4588 | 2014-07-08 00:00:00 | 0.000000e+00 | 4.0 | 2.25 | 2890 | 18226 | 3.0 | 1 | 4 | 3 | 2890 | 0 | 1984 | 0 | 3227-3399 Mountain View Ave N | Renton | WA 98056 | USA |
| 1827 | 2014-06-03 00:00:00 | 5.750000e+05 | 4.0 | 2.75 | 3120 | 7644 | 2.0 | 0 | 0 | 3 | 3120 | 0 | 2010 | 0 | 9423 Ash Ave SE | Snoqualmie | WA 98065 | USA |
| 3492 | 2014-06-26 00:00:00 | 2.195000e+05 | 3.0 | 1.00 | 1090 | 6710 | 1.5 | 0 | 0 | 5 | 1090 | 0 | 1912 | 0 | 116 J St SE | Auburn | WA 98002 | USA |
| 1448 | 2014-05-28 00:00:00 | 5.550000e+05 | 3.0 | 2.50 | 3160 | 4270 | 2.0 | 0 | 0 | 3 | 2650 | 510 | 2006 | 0 | 11131 NE 162nd St | Bothell | WA 98011 | USA |
| 836 | 2014-05-16 00:00:00 | 3.300000e+05 | 3.0 | 1.50 | 1170 | 4950 | 1.0 | 0 | 0 | 4 | 1170 | 0 | 1960 | 2001 | 15421 SE 4th Pl | Bellevue | WA 98007 | USA |
| 1617 | 2014-05-30 00:00:00 | 1.365000e+06 | 3.0 | 2.50 | 2090 | 6000 | 1.5 | 0 | 0 | 4 | 2090 | 0 | 1928 | 0 | 3832 43rd Ave NE | Seattle | WA 98105 | USA |
df.isnull().sum() # check for missing values
date 0 price 0 bedrooms 0 bathrooms 0 sqft_living 0 sqft_lot 0 floors 0 waterfront 0 view 0 condition 0 sqft_above 0 sqft_basement 0 yr_built 0 yr_renovated 0 street 0 city 0 statezip 0 country 0 dtype: int64
DATA CLEANINGΒΆ
Handle missing values,outliers,and inconsistencies in the data.This may involve imputation or removal of problematic data points.
df.info() # method prints information about the DataFrame.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4600 entries, 0 to 4599 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 4600 non-null object 1 price 4600 non-null float64 2 bedrooms 4600 non-null float64 3 bathrooms 4600 non-null float64 4 sqft_living 4600 non-null int64 5 sqft_lot 4600 non-null int64 6 floors 4600 non-null float64 7 waterfront 4600 non-null int64 8 view 4600 non-null int64 9 condition 4600 non-null int64 10 sqft_above 4600 non-null int64 11 sqft_basement 4600 non-null int64 12 yr_built 4600 non-null int64 13 yr_renovated 4600 non-null int64 14 street 4600 non-null object 15 city 4600 non-null object 16 statezip 4600 non-null object 17 country 4600 non-null object dtypes: float64(4), int64(9), object(5) memory usage: 647.0+ KB
df.describe() #calculates summary statistics for all numerical columns in the Dataframe.
| price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4.600000e+03 | 4600.000000 | 4600.000000 | 4600.000000 | 4.600000e+03 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 |
| mean | 5.519630e+05 | 3.400870 | 2.160815 | 2139.346957 | 1.485252e+04 | 1.512065 | 0.007174 | 0.240652 | 3.451739 | 1827.265435 | 312.081522 | 1970.786304 | 808.608261 |
| std | 5.638347e+05 | 0.908848 | 0.783781 | 963.206916 | 3.588444e+04 | 0.538288 | 0.084404 | 0.778405 | 0.677230 | 862.168977 | 464.137228 | 29.731848 | 979.414536 |
| min | 0.000000e+00 | 0.000000 | 0.000000 | 370.000000 | 6.380000e+02 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 370.000000 | 0.000000 | 1900.000000 | 0.000000 |
| 25% | 3.228750e+05 | 3.000000 | 1.750000 | 1460.000000 | 5.000750e+03 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 |
| 50% | 4.609435e+05 | 3.000000 | 2.250000 | 1980.000000 | 7.683000e+03 | 1.500000 | 0.000000 | 0.000000 | 3.000000 | 1590.000000 | 0.000000 | 1976.000000 | 0.000000 |
| 75% | 6.549625e+05 | 4.000000 | 2.500000 | 2620.000000 | 1.100125e+04 | 2.000000 | 0.000000 | 0.000000 | 4.000000 | 2300.000000 | 610.000000 | 1997.000000 | 1999.000000 |
| max | 2.659000e+07 | 9.000000 | 8.000000 | 13540.000000 | 1.074218e+06 | 3.500000 | 1.000000 | 4.000000 | 5.000000 | 9410.000000 | 4820.000000 | 2014.000000 | 2014.000000 |
df.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| price | 4600.0 | 551962.988473 | 563834.702547 | 0.0 | 322875.00 | 460943.461539 | 654962.50 | 26590000.0 |
| bedrooms | 4600.0 | 3.400870 | 0.908848 | 0.0 | 3.00 | 3.000000 | 4.00 | 9.0 |
| bathrooms | 4600.0 | 2.160815 | 0.783781 | 0.0 | 1.75 | 2.250000 | 2.50 | 8.0 |
| sqft_living | 4600.0 | 2139.346957 | 963.206916 | 370.0 | 1460.00 | 1980.000000 | 2620.00 | 13540.0 |
| sqft_lot | 4600.0 | 14852.516087 | 35884.436145 | 638.0 | 5000.75 | 7683.000000 | 11001.25 | 1074218.0 |
| floors | 4600.0 | 1.512065 | 0.538288 | 1.0 | 1.00 | 1.500000 | 2.00 | 3.5 |
| waterfront | 4600.0 | 0.007174 | 0.084404 | 0.0 | 0.00 | 0.000000 | 0.00 | 1.0 |
| view | 4600.0 | 0.240652 | 0.778405 | 0.0 | 0.00 | 0.000000 | 0.00 | 4.0 |
| condition | 4600.0 | 3.451739 | 0.677230 | 1.0 | 3.00 | 3.000000 | 4.00 | 5.0 |
| sqft_above | 4600.0 | 1827.265435 | 862.168977 | 370.0 | 1190.00 | 1590.000000 | 2300.00 | 9410.0 |
| sqft_basement | 4600.0 | 312.081522 | 464.137228 | 0.0 | 0.00 | 0.000000 | 610.00 | 4820.0 |
| yr_built | 4600.0 | 1970.786304 | 29.731848 | 1900.0 | 1951.00 | 1976.000000 | 1997.00 | 2014.0 |
| yr_renovated | 4600.0 | 808.608261 | 979.414536 | 0.0 | 0.00 | 0.000000 | 1999.00 | 2014.0 |
df["date"]=pd.to_datetime(df["date"])
df.dtypes
date datetime64[ns] price float64 bedrooms float64 bathrooms float64 sqft_living int64 sqft_lot int64 floors float64 waterfront int64 view int64 condition int64 sqft_above int64 sqft_basement int64 yr_built int64 yr_renovated int64 street object city object statezip object country object dtype: object
df.duplicated().sum()
np.int64(0)
df.groupby("sqft_lot")[["bedrooms","bathrooms","floors","view","condition"]].sum().sort_values(by="sqft_lot",ascending=False).head(15)
| bedrooms | bathrooms | floors | view | condition | |
|---|---|---|---|---|---|
| sqft_lot | |||||
| 1074218 | 5.0 | 3.25 | 1.5 | 0 | 5 |
| 641203 | 2.0 | 2.00 | 2.0 | 0 | 3 |
| 478288 | 3.0 | 1.75 | 1.5 | 3 | 4 |
| 435600 | 6.0 | 5.50 | 3.5 | 3 | 5 |
| 423838 | 2.0 | 1.00 | 1.0 | 2 | 5 |
| 389126 | 3.0 | 1.00 | 1.5 | 0 | 4 |
| 327135 | 3.0 | 2.50 | 2.0 | 0 | 3 |
| 307752 | 7.0 | 8.00 | 3.0 | 4 | 3 |
| 306848 | 3.0 | 1.00 | 1.0 | 0 | 3 |
| 284011 | 4.0 | 4.50 | 2.0 | 0 | 4 |
| 280962 | 2.0 | 1.75 | 2.0 | 2 | 3 |
| 265716 | 4.0 | 1.75 | 1.0 | 0 | 4 |
| 258746 | 5.0 | 3.00 | 1.5 | 0 | 4 |
| 251341 | 3.0 | 2.00 | 2.0 | 0 | 3 |
| 250470 | 3.0 | 1.75 | 1.0 | 0 | 4 |
DataFrame showing the sum of bedrooms, bathrooms, floors, view, and condition for each unique sqft_lot, sorted in descending order by sqft_lot, and displaying the top 15 rows.
Data VisualisationΒΆ
num_col=df[df.dtypes[df.dtypes != 'object'].index]
num_col
plt.figure(figsize=(15,10))
sns.heatmap(num_col.corr(),annot=True)
<Axes: >
The code generates a heatmap showing the correlation matrix of the numeric columns in the DataFrame df, with correlation coefficients annotated on the heatmap.
sns.set(rc={'figure.figsize':(12,8)})
sns.distplot(df['bathrooms'],kde=False,bins=20); ##Kernal density estimation
his will generate and display a histogram of the bathrooms column with 20 bins, and it will not include a kernel density estimate (KDE) curve. The figure size is set to 12x8 inches, as specified.
sns.kdeplot(df['bathrooms']) # Kernal Density Estimation
<Axes: xlabel='bathrooms', ylabel='Density'>
A KDE plot of the bathrooms column, illustrating the estimated probability density function of the data distribution.
df['bathrooms'].describe()
count 4600.000000 mean 2.160815 std 0.783781 min 0.000000 25% 1.750000 50% 2.250000 75% 2.500000 max 8.000000 Name: bathrooms, dtype: float64
Summary statistics including count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum values of the bathrooms column.
sns.set(rc={'figure.figsize':(12,8)})
sns.distplot(df['bedrooms'],bins=20);
A histogram of the bedrooms column with 20 bins, displaying the distribution of the data.
HistogramΒΆΒΆ
In exploratory data analysis(EDA),a histogram is a graphical representation that shows the distibution of a dataset.It display the frequency of data points falling within specified ranges or bins.By plotting a histogram,you can quickly see patterns,such as the distribution shape,cental tendency,and spread of the data.It helps in identifying skewness, outliers,and the overall distribution of the data,which is essential for understanding the underlying patterns and macking informed decisions.
plt.figure(figsize=(7,3))
sns.histplot(x="bedrooms",data=df,color="red",binwidth=1)
<Axes: xlabel='bedrooms', ylabel='Count'>
A red histogram of the bedrooms column with bin width of 1, displayed in a 7x3 inch figure.
plt.figure(figsize=(10,5))
sns.countplot(x='bedrooms',data=df)
<Axes: xlabel='bedrooms', ylabel='count'>
A bar plot showing the count of each unique value in the bedrooms column, displayed in a 10x5 inch figure.
plt.figure(figsize=(10,5))
sns.countplot(x='bedrooms',data=df)
plt.xticks(rotation=90)
([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [Text(0, 0, '0.0'), Text(1, 0, '1.0'), Text(2, 0, '2.0'), Text(3, 0, '3.0'), Text(4, 0, '4.0'), Text(5, 0, '5.0'), Text(6, 0, '6.0'), Text(7, 0, '7.0'), Text(8, 0, '8.0'), Text(9, 0, '9.0')])
This code generates a bar plot of counts for each unique value in the bedrooms column from the DataFrame df, with the figure size set to 10x5 inches and x-axis labels rotated 90 degrees.
ScatterplotΒΆ
A scatter plot is a two-dimentional data visualization that uses dots to represent the values obtained for two different variables-one is plotted along the x-axis and the other plotted along the y-axis.
plt.scatter(x="bedrooms",y="bathrooms",data=df)
plt.xlabel("Bedrooms")
plt.ylabel("Bathrooms")
plt.title("Scatter plot of Bedrooms vs Bathrooms")
plt.show()
A scatter plot of bedrooms vs bathrooms with labeled axes and a title, showing the relationship between the two variables.
sns.scatterplot(x="bedrooms",y="bathrooms",data=df)
<Axes: xlabel='bedrooms', ylabel='bathrooms'>
A scatter plot showing the relationship between bedrooms and bathrooms.
sns.regplot(x="bedrooms",y="bathrooms",data=df,scatter=True,fit_reg=True)
<Axes: xlabel='bedrooms', ylabel='bathrooms'>
This code generates a scatter plot of bedrooms versus bathrooms with a fitted regression line, showing both the data points and the regression line.
sns.regplot(x="bedrooms",y="bathrooms",data=df,scatter=True,fit_reg=False)
<Axes: xlabel='bedrooms', ylabel='bathrooms'>
import pandas as pd
col_names = ['sepal_length','sepal_width','petal_length','petal_width','species']
csv_url = 'data (1).csv'
flowers_df = pd.read_csv(csv_url, names = col_names)
flowers_df.columns
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
'species'],
dtype='object')
flowers_df['species'].unique()
array(['country', 'USA'], dtype=object)
plt.plot(flowers_df.sepal_length,flowers_df.sepal_width)
plt.show()
A line plot of sepal_length vs sepal_width showing the trend between these two variables.
plt.plot(flowers_df.petal_length,flowers_df.petal_width);
plt.show()
A line plot of petal_length vs petal_width, showing the trend between these two variables.
sns.scatterplot(x=flowers_df.sepal_length,y=flowers_df.sepal_width);
plt.show()
This code generates a scatter plot of sepal_length versus sepal_width from the flowers_df DataFrame, displaying the individual data points for these two variables.
sns.scatterplot(x=flowers_df.petal_length,y=flowers_df.petal_width);
plt.show()
A scatter plot of petal_length versus petal_width, showing the distribution of data points for these variables.
sns.pairplot(df)
plt.show()
A matrix of scatter plots and histograms showing pairwise relationships and distributions of all numeric variables in the DataFrame.
BoxplotsΒΆ
A boxplot,also known as a box-and-whisker plot,is a graphical representation used to visualize the distribution and summary statistics of a dataset.
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
sns.boxplot(y='bedrooms',data=df, width=0.2)
<Axes: ylabel='bedrooms'>
A vertical box plot of bedrooms with a box width of 0.2, showing the distribution and summary statistics of the data.
df.hist(bins=50, figsize=(15, 15))
array([[<Axes: title={'center': 'date'}>,
<Axes: title={'center': 'price'}>,
<Axes: title={'center': 'bedrooms'}>,
<Axes: title={'center': 'bathrooms'}>],
[<Axes: title={'center': 'sqft_living'}>,
<Axes: title={'center': 'sqft_lot'}>,
<Axes: title={'center': 'floors'}>,
<Axes: title={'center': 'waterfront'}>],
[<Axes: title={'center': 'view'}>,
<Axes: title={'center': 'condition'}>,
<Axes: title={'center': 'sqft_above'}>,
<Axes: title={'center': 'sqft_basement'}>],
[<Axes: title={'center': 'yr_built'}>,
<Axes: title={'center': 'yr_renovated'}>, <Axes: >, <Axes: >]],
dtype=object)
A grid of histograms for all numeric columns in the DataFrame, with 50 bins each, displayed in a 15x15 inch figure.
sns.boxplot(x='bedrooms',y='price',data=df)
<Axes: xlabel='bedrooms', ylabel='price'>
plt.figure(figsize=(12,5))
sns.boxplot(data=df)
<Axes: >
plt.figure(figsize=(10, 6))
for col in df.columns:
sns.boxplot(df[col])
plt.title(f"Distribution of {col}")
plt.xlabel(col)
plt.xticks(rotation=45)
plt.show()
Individual box plots for each column in the DataFrame, with titles, labeled x-axes, and rotated x-axis labels, displayed one at a time in a 10x6 inch figure.
Conclusion of House Price Prediction Data AnalysisΒΆ
The house price prediction dataset enables the development of models to estimate property values based on various features and historical data.